Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian

نویسندگان

چکیده

We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish Estonian. Due to morphological processes such as derivation, inflection compounding, the need be trained with sizes several millions word types. Class-based modelling is in this case a powerful approach alleviate data sparsity reduce computational load. For vocabulary, bigram statistics may not an optimal way derive classes. thus utilizing output analyzer achieve efficient show that classes can learned by refining smaller equivalence using merging, splitting exchange procedures suitable constraints. This type classification improve results, particularly when model training large. also extend previous analyses rescoring hypotheses obtained from recognizer models. despite fixed carefully constructed word-based some cases result lower error rates than subword-based unlimited

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Large Vocabulary Continuous Speech Recognition for Estonian Using Morpheme Classes

This paper describes development of a large vocabulary continuous speaker independent speech recognition system for Estonian. Estonian is an agglutinative language and the number of different word forms is very large, in addition, the word order is relatively unconstrained. To achieve a good language coverage, we use pseudo-morphemes as basic units in a statistical trigram language model. To im...

متن کامل

Large Vocabulary Continuous Speech Recognition for Estonian Using Morphemes and Classes

This paper describes development of a large vocabulary continuous speaker independent speech recognition system for Estonian. Estonian is an agglutinative language and the number of different word forms is very large, in addition, the word order is relatively unconstrained. To achieve a good language coverage, we use pseudo-morphemes as basic units in a statistical trigram language model. To im...

متن کامل

Estonian Large Vocabulary Speech Recognition System for Radiology

This paper describes implementation and evaluation of an Estonian large vocabulary continuous speech recognition system prototype for the radiology domain. We used a 44 million word corpus of radiology reports to build a word trigram language model. We recorded a test set of dictated radiology reports using ten radiologists. Using speaker independent speech recognition, we achieved a 9.8% word ...

متن کامل

Towards very large vocabulary word recognition

i In mis paper, preliminary considerations and some experimental results are presented in an effort to design Very Large Vocabulary Recognition (VLVR) systems. We will first consider the applicability of current recognition techniques and argue their inadequacy for VLVR. Possible alternate strategies will be explored and their potential usefulness statistically evaluated. Our results indicate t...

متن کامل

Vocabulary Decomposition for Estonian Open Vocabulary Speech Recognition

Speech recognition in many morphologically rich languages suffers from a very high out-of-vocabulary (OOV) ratio. Earlier work has shown that vocabulary decomposition methods can practically solve this problem for a subset of these languages. This paper compares various vocabulary decomposition approaches to open vocabulary speech recognition, using Estonian speech recognition as a benchmark. C...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Computer Speech & Language

سال: 2021

ISSN: ['1095-8363', '0885-2308']

DOI: https://doi.org/10.1016/j.csl.2020.101141